{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(data-quickstart)=\n",
    "# Data Analysis Quickstart\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "Here we'll do a whistlestop tour of data analysis on a *single* dataframe using the Star Wars' characters dataset as an example. For a more thorough grounding in using data, see the next chapter.\n",
    "\n",
    "This chapter uses the **pandas**, **pyarrow**, and **numpy** packages. If you're running this code, you may need to install these packages, which you can do using either `conda install packagename` or `pip install packagename` on your computer's command line. (If you're not sure what a command line or terminal is, take a quick look at the basics-of-coding chapter.)\n",
    "\n",
    "This chapter is hugely indebted to the fantastic [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/), and both the **pandas** [documentation](https://pandas.pydata.org/docs/user_guide/index.html) and amazing [introductory tutorials](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/index.html).\n",
    "\n",
    "## Loading data and checking datatypes\n",
    "\n",
    "Loading data into a dataframe is achieved with commands like `df = pd.read_csv(...)` or `df = pd.read_stata(...)`. Let's load the Star Wars data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import os\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from pathlib import Path\n",
    "\n",
    "# Set seed for reproducibility\n",
    "np.random.seed(10)\n",
    "# Set max rows displayed for readability\n",
    "pd.set_option(\"display.max_rows\", 6)\n",
    "# Plot settings\n",
    "plt.style.use(\n",
    "    \"https://github.com/aeturrell/coding-for-economists/raw/main/plot_style.txt\"\n",
    ")\n",
    "\n",
    "df = pd.read_csv(\n",
    "    \"https://github.com/aeturrell/coding-for-economists/raw/main/data/starwars.csv\",\n",
    "    index_col=0,\n",
    ")\n",
    "# Check info about dataframe\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Look at the first few rows with `head()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filter rows and columns with conditions using `df.loc[condition(s) or row(s), column(s)]`\n",
    "\n",
    "`.loc` stands for location and allows you to filter (aka subset) a dataframe. `.loc` works like an index, so it always comes with square brackets, eg `df.loc[...]`.\n",
    "\n",
    "`loc` takes two arguments. The first is a list of the names of the rows that you'd like to select *or* a condition (ie a list of booleans with the same length as the dataframe) that selects certain rows. Remember, you can easily create a series of booleans by checking a column against a condition, for example `df['column1'] == 'black'`.\n",
    "\n",
    "The second argument consists of a list of column names you'd like to select. In both cases, `:` is shorthand for 'use all rows' or 'use all columns'. If you have *either* condition(s) *or* column(s) (but not both), you can simply write `df[condition(s)]` or `df[column(s)]`.\n",
    "\n",
    "Here's an example with a condition built up out of two parts and a list of columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.loc[(df[\"hair_color\"] == \"brown\") & (df[\"eye_color\"] == \"blue\"), [\"name\", \"species\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sort rows or columns with `.sort_values()`\n",
    "\n",
    "Use `sort_values(columns, ascending=False)` for descending order."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.sort_values([\"height\", \"mass\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Choose multiple rows or columns using slices\n",
    "\n",
    "Slices can be passed by name using `.loc[startrow:stoprow:step, startcolumn:stopcolumn:step]` or by position using `.iloc[start:stop:step, start:stop:step]`.\n",
    "\n",
    "Choosing every 10th row from the second, and the columns between 'name' and 'gender':"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.loc[2::10, \"name\":\"gender\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that `loc` only works here with numbers for rows because it just so happens that the names of the rows *are* numbers. If the rows had names that were strings, and we wanted to subset rows by their index position, we would have to use `iloc` instead.\n",
    "\n",
    "Choosing the first 5 rows and the last 2 columns by index position:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.iloc[:5, -2:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Randomly selecting a sample using `.sample`\n",
    "\n",
    "`.sample(n)` randomly selects `n` rows, `.sample(frac=0.4)` selects 40% of the data, `replace=True` samples with replacement, and passing `weights=` selects a number or fraction with the probabilities given by the passed weights. (Note that weights passed should have the same length as the dataframe.)\n",
    "\n",
    "Taking a sample of 5 rows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.sample(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Rename with `.rename`\n",
    "\n",
    "You can rename all columns by passing a function, for instance `df.rename(columns=str.lower)` to put all columns in lower case. Alternatively, use a dictionary to say which columns should be mapped to what:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.rename(columns={\"homeworld\": \"home_world\"})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Add new columns with `.assign` or assignment\n",
    "\n",
    "Very often you will want to create new columns based on existing columns.\n",
    "\n",
    "![](https://pandas.pydata.org/docs/_images/05_newcolumn_1.svg)\n",
    "\n",
    "`.assign` is a function that applies to a dataframe and returns a copy of that dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.assign(height_m=df[\"height\"] / 100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This was added to the end; ideally, we'd like it next to the height column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(df.assign(height_m=df[\"height\"] / 100).sort_index(axis=1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To overwrite existing columns just use `height = df['height']/100` in the assign function. We can also assign using a series of statements instead of chaining them together using\n",
    "\n",
    "```python\n",
    "df['height_m'] = df['height']/100\n",
    "df = df.sort_index(axis=1)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summarise numerical values with `.describe()`\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Group variables values with `.groupby()`\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(\"species\")[[\"height\", \"mass\"]].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Add transformed columns using `.transform()`\n",
    "\n",
    "Quite often, it's useful to put a column into a dataframe that is the result of an intermediate groupby and aggregation. For example, subtracting the group mean or normalisation. Transform does this and returns the column with the same shape as the original dataframe.\n",
    "\n",
    "Below is an example of transform being used to demean a variable according to the mean by species. Note that we use `pd.notna` to filter out missing species values (otherwise this would result in an error, which is actually *helpful* behaviour) and we are using lambda functions. Lambda functions are a quick way of writing functions, e.g. `lambda x: x+1` defines a function that adds one to x."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# First take a cut of the dataframe with no NA values in the species column:\n",
    "df_cut = df[pd.notna(df[\"species\"])]\n",
    "# Now create a demeaned mass col\n",
    "(\n",
    "    df_cut.assign(\n",
    "        mass_demean_species=df_cut.groupby(\"species\")[\"mass\"].transform(\n",
    "            lambda x: x - x.mean()\n",
    "        )\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Make quick charts with `.plot.*`\n",
    "\n",
    "Including scatter, area, bar, box, density, hexbin, histogram, kde, and line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.plot.scatter(\"mass\", \"height\", alpha=0.5);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.plot.box(\"species\");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"height\"].plot.kde(bw_method=0.3);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Export results and descriptive statistics to latex with `to_latex`\n",
    "\n",
    "You'll often want to export your results to a latex file for inclusion in a paper, presentation, or poster. Let's say we had some descriptive statistics on a dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "table = df[[\"mass\", \"height\"]].agg([np.mean, np.std])\n",
    "table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(table.to_latex(caption=\"A Table\", label=\"tab:descriptive\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Writing to the terminal isn't that useful for getting your paper done! To export to a file, use `table.to_latex('file.tex', ...)`."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "codeforecon",
   "language": "python",
   "name": "codeforecon"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
